exploration scheme
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (5 more...)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (5 more...)
Bandit Convex Optimization: Towards Tight Bounds
Bandit Convex Optimization (BCO) is a fundamental framework for decision making under uncertainty, which generalizes many problems from the realm of online and statistical learning. While the special case of linear cost functions is well understood, a gap on the attainable regret for BCO with nonlinear losses remains an important open question. In this paper we take a step towards understanding the best attainable regret bounds for BCO: we give an efficient and near-optimal regret algorithm for BCO with strongly-convex and smooth loss functions. In contrast to previous works on BCO that use time invariant exploration schemes, our method employs an exploration scheme that shrinks with time.
Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language tasks, yet aligning their behavior with human preferences remains a central challenge. A widely adopted solution is reinforcement learning with human feedback (RLHF), which fine-tunes a pretrained LLM using human preference data (Bai et al., 2022; Christiano et al., 2017; Ziegler et al., 2019). The standard RLHF pipeline involves three stages: (i) supervised fine-tuning (SFT) on human-written demonstrations to produce a baseline model; (ii) training a reward model from human preference comparisons (Bradley and Terry, 1952); and (iii) optimizing the LLM with reinforcement learning against the learned reward. This framework has been instrumental in the success of instruction-following LLMs such as InstructGPT (Ouyang et al., 2022) and ChatGPT (OpenAI, 2023), enabling models to produce responses that are more helpful, safe, and aligned with human expectations. Despite this progress, most existing RLHF implementations are offline (Azar et al., 2024; Rafailov et al., 2024; Zhao et al., 2023): the preference data is collected once from static policies, and the reward model is trained on this fixed dataset (Ivison et al., 2023; Shi et al., 2025; Zhu et al., 2024). While effective, offline RLHF has inherent limitations--It cannot adaptively explore the enormous space of natural language, leading to inefficient use of expensive human feedback. In contrast, online RLHF offers a more powerful alternative: the policy iteratively collects new preference data, updates the reward model, and improves itself based on these updates (Chen et al., 2024; Dong et al., 2024; Feng et al., 2025; Guo et al., 2024; Rosset et al., 2024; Xiong et al., 2023).
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- Asia > China > Hong Kong (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
Bandit Convex Optimization: Towards Tight Bounds
Bandit Convex Optimization (BCO) is a fundamental framework for decision making under uncertainty, which generalizes many problems from the realm of online and statistical learning. While the special case of linear cost functions is well understood, a gap on the attainable regret for BCO with nonlinear losses remains an important open question. In this paper we take a step towards understanding the best attainable regret bounds for BCO: we give an efficient and near-optimal regret algorithm for BCO with strongly-convex and smooth loss functions. In contrast to previous works on BCO that use time invariant exploration schemes, our method employs an exploration scheme that shrinks with time.
Active Learning for Skewed Data Sets
Kazerouni, Abbas, Zhao, Qi, Xie, Jing, Tata, Sandeep, Najork, Marc
Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of initial training data. Both of these problems occur with surprising frequency in many web applications. For instance, detecting offensive or sensitive content in online communities (pornography, violence, and hate-speech) is receiving enormous attention from industry as well as research communities. Such problems have both the characteristics we describe -- a vast majority of content is not offensive, so the number of positive examples for such content is orders of magnitude smaller than the negative examples. Furthermore, there is usually only a small amount of initial training data available when building machine-learned models to solve such problems. To address both these issues, we propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data available. Through simulation results, we show that HAL makes significantly better choices for what points to label when compared to strong baselines like margin-sampling. Classifiers trained on the examples selected for labeling by HAL easily out-perform the baselines on target metrics (like area under the precision-recall curve) given the same budget for labeling examples. We believe HAL offers a simple, intuitive, and computationally tractable way to structure active learning for a wide range of machine learning applications.
Bandit Convex Optimization: Towards Tight Bounds
Bandit Convex Optimization (BCO) is a fundamental framework for decision making under uncertainty, which generalizes many problems from the realm of online and statistical learning. While the special case of linear cost functions is well understood, a gap on the attainable regret for BCO with nonlinear losses remains an important open question. In this paper we take a step towards understanding the best attainable regret bounds for BCO: we give an efficient and near-optimal regret algorithm for BCO with strongly-convex and smooth loss functions. In contrast to previous works on BCO that use time invariant exploration schemes, our method employs an exploration scheme that shrinks with time. Papers published at the Neural Information Processing Systems Conference.
Exploration-Enhanced POLITEX
Abbasi-Yadkori, Yasin, Lazic, Nevena, Szepesvari, Csaba, Weisz, Gellert
We study algorithms for average-cost reinforcement learning problems with value function approximation. Our starting point is the recently proposed POLITEX algorithm, a version of policy iteration where the policy produced in each iteration is near-optimal in hindsight for the sum of all past value function estimates. POLITEX has sublinear regret guarantees in uniformly-mixing MDPs when the value estimation error can be controlled, which can be satisfied if all policies sufficiently explore the environment. Unfortunately, this assumption is often unrealistic. Motivated by the rapid growth of interest in developing policies that learn to explore their environment in the lack of rewards (also known as no-reward learning), we replace the previous assumption that all policies explore the environment with that a single, sufficiently exploring policy is available beforehand. The main contribution of the paper is the modification of POLITEX to incorporate such an exploration policy in a way that allows us to obtain a regret guarantee similar to the previous one but without requiring that all policies explore environment. In addition to the novel theoretical guarantees, we demonstrate the benefits of our scheme on environments which are difficult to explore using simple schemes like dithering. While the solution we obtain may not achieve the best possible regret, it is the first result that shows how to control the regret in the presence of function approximation errors on problems where exploration is nontrivial. Our approach can also be seen as a way of reducing the problem of minimizing the regret to learning a good exploration policy. We believe that modular approaches like ours can be highly beneficial in tackling harder control problems.
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)